library(ggplot2)
library(dplyr)
library(GGally)
library(scales)
library(memisc)
library(gridExtra)
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
The dataset contains 11 quantitative variables with 1599 observations. The output is a qualitative variable.
From the summary and quality histogram, the quality distribution looks to be normal with mean between 5 and 6 and median at 6.There are no records for quality 9 and 10. As such quality 8 is the highest grade of red wine available in the dataset. The lowest quality seems to be 3. Also, there are a lot of data points for quality 5 and 6 and very few for others.
It’d be interesting to see how each variables affect the quality of wine.
Let’s study impact of each variable on quality in depth based on description and data.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.700 7.150 7.500 8.360 9.875 11.600
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.600 6.800 7.500 7.779 8.400 12.500
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000 7.100 7.800 8.167 8.900 15.900
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.700 7.000 7.900 8.347 9.400 14.300
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.900 7.400 8.800 8.872 10.100 15.600
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000 7.250 8.250 8.567 10.225 12.600
The fixed acidity seems to be in the similar ranges for all quality grades of the red wine. It’d be helpful to visualize the distributions with boxplots and scatter plots.
Looks like the fixed.acidity are in the similar ranges. Fixed.acidity for quality 8 and 3 looks very similar from the box plots. Doesn’t look like fixed.acidity has any impact on the quality.
As can be seen from the above plot, there are lot of data points for quality 5 and 6. Very few for quality 3,4 and 8. The histogram for quality 5 and 6 also looks quite similar.
Looks like higher the volatile.acidity lower the quality and vice versa. This is in agreement with the description. Higher amount of acetic acid in wine leads to unpleasant, vinegar taste. This is one of the good indicator of wine quality. Although there are some datapoints with lower volatile.acidity in lower quality, those data points could be affected by other variables.
In this histogram also we can see that there are few data points for quality 3,4 and 8 while large data points for 5 and 6. We can also see the gradual shift of the median between quality 5, 6 and 7.
In general, higher citric.acid seems to be good for wine. 0.3 to 0.53 seems to be good amount of citric.acid in wine.
Citric acid distribution seems to be bimodal for quality 5,6, and 7. Also, second peak seems to be shifting towards right from 0.25 for quality 5 to .4 for quality 7.
The interquantile bands are very close. It’d be good to limit the range and observe them.
Residual sugars seems to be in the same concentrations for each quality. There are no sweet wines with sugar concentration greater than 45 grams/liter in this dataset. It’d be interesting to see how red sweet wines are rated.
As seen before, most of the data points are for quality 5 and 6. Both of these plots seem to have normal distribution with peak at 2.
Outliers are way outside the interquantile range. Need to limit the range and observe the dynamics of the interquantile range across quality grades.
The median seems to be shifting downward from quality 3 to 8. However, there is a huge overlap of interquantile range across all quality grades. Also, the range is narrowing from lower to higher quality. This may also be because of lack data points.
Most data points across different quality grades lie within 0.2 with peak at 0.1.
This is quite a weird plot with median increasing from quality 3 to 5 and going down from quality 5 to 8. May be the histogram can help analyze the distribution.
The weird shift in the median seems to be due to lack of datapoints. The histogram for quality 5 and 6 seems to be similar. Lets analyze histogram by changing the y scale to log.
In the above plot, we can clearly see the counts of differnt bins of free sulfur dioxide values across differnt quality grades. We can see there are very few data points for quality 3, 4 and 8. We cannot rely on such distributions. However, from quality 5,6, and 7, there doesn’t seem to be any shift in trends.
Also, as per description, free SO2 > 50 ppm would be evident in taste. So, this should affect the wine quality negatively. Unfortunately, there are no many data points with free SO2>50ppm. So, we can’t confirm this statement with the available data points. Also, within the free SO2>50ppm range, because of above observations, we can’t identify any relation between free sulfur concentration and quality.
Again, we can see the similar trend in median as we saw above. This can be very misleading. Lets look at the histogram.
As expected, due to lack of data points, the trend we saw in the boxplots is invlaid. Both, free sulfur dioxide and total sulfur dioxide seems to have no impact on quality. We are not ablel to identify the best range of value for these variables for better wine quality due to lack data points and overlap of ranges. I guess, even smaller quantity is good enough to prevent microbial growth and oxidation.
From the above boxplots, lower the density seems to be better but there is huge overlap of datapoints for all quality grades. Lets look at the histogram.
We can’t use quality 3,4 and 6 due to insufficient data points. However, we can still see the downward shift in median in quality grades 5,6 and 7.
There is significant overlap of pH concentrations on all quality grades. However, there seems to be a small downward trend in median pH content. Looks like lower pH seems to relate to higher quality. Specificially, 3-3.5 seems to be a good range for quality wine.
From quality 5,6,and 7, we can see the small downward trend in pH values for better wine quality. While the lower pH seems to be better, I wouldn’t expect very acidic wine to be pleasant. With more data points across quality grades we could have probably identified the best range of pH for wine.
The interquantile ranges are overshadowed by the outliers. It’d be good to limit the y-axis.
Sulphates contribute to SO2 which acts as antimicroial and antioxidant. However, SO2 itself was identified as having no significant impact on the quality. On the contrary to the SO2 impact, higher Sulphates concentrations seems to lead to higher quality. Specifically, 0.7-0.8 units seems to be good concentration for quality wine. But can we trust the range from this plot? No, let’s look at the histogram.
Very few data points. The best range that we determined from quality 8 is not valid because of lack of data points. However, from quality 5,6 and 7, range seems to be 0.5 to 1.
From the above graphs, higher alcohol percent seems to be good for wine. There is quite discernible growth in quality with increasing medians in alcohol content (except for quality 5, could be affected by other variables)
The above observation can be confirmed from the histogram for quality 5,6, and 7.
From the above univariate analysis, citric acid, sulphates and alcohol concentration seems to be directly proportional to good quality while volatile acidity, density and pH indirectly proportional to good quality of wine. Correlation could be a good metric to rank the influence of these factors on quality.
with(wine, cor(quality, alcohol))
## [1] 0.4761663
with(wine, cor(quality, volatile.acidity))
## [1] -0.3905578
with(wine, cor(quality, sulphates))
## [1] 0.2513971
with(wine, cor(quality, citric.acid))
## [1] 0.2263725
with(wine, cor(quality, density))
## [1] -0.1749192
with(wine, cor(quality, pH))
## [1] -0.05773139
We analyzed each variable with quality using box plots and histograms for each variable. Many variables shared significant overlap in their distribution between various grades of quality. While we segregated the histograms faceted by quality, it’d be good to analyze all of them together on the same plot distinguished by color. This can help us give insights into patterns across quality grades for each variable.
In the dataset, the number of datapoints for quality 3,4 and 8 are very less. While we compare the patterns for all quality grades, it’d be good to observe patterns within variables for quality grades 5,6 and 7.
Now, lets analyze patterns within each variable for quality 5,6 and 7.
All quality grades have significant overlap over the range of the variable. This graph will not help deduce any relationship between variable and quality.
Also, looks like we are making similar analysis as we did with boxplots since this is just another way to plot data by changing axes. As such, we can plot all variables together and analyze for any new patterns.
From the above graphs, we can conclude the following.
* Variables with positive relation to quality: Fixed acidity, citric acid, sulphates and alcohol.
* Variables with negative relation to quality: Volatile acidity, density. Also, pH seems to be showing a small negative trend.
Lets enhance the peaks by plotting in log scale to see the shifts in peaks.
The log scale plots are more helpful in confirming the above observations in trends.
Now, lets try to find the best range for each variable for best quality wine. However, lets keep in mind that there are few data points to notice any significant patterns.
The plots look quite uniform due to lack of data points for quality grades 8. It’d be good to consider quality grades 7 and 8 together for higher quality.
The above distribution of variables for quality grades 7 and 8 help narrow the ranges.
* Fixed acidity seems to be better between 7 and 10 units.
* Volatile acidity seems to be better between .25 and .5 units.
* Citric acid concentration seems to be better between 0.3 and 0.5 units.
* Residual sugars are better between 2 and 2.5 units.
* Chlorides are better between 0.05 and 0.8 units.
* Free sulfur dioxide seems to be decaying down from 5 to 40 with good range between 5 and 15.
* Total sulfur dioxide seems to be decaying down from 10 to 100 with good range between 10 and 40.
* Density seems to be better between 0.994 to 0.998.
* pH seems to be better between 3.2 and 3.4.
* Sulphates seems to be good between 0.6 and 0.9.
* Alcohol content seems to be good between 10 and 12.
From the univariate analysis, we identified a set of variables which impact the quality. Let’s analyze these set of variable for how they affect each other.
From the pairwise investigation, the following are observed.
* These above variables identified to impact quality by univariate analysis. However, none of these variables have good correlation with quality.
* Fixed.acidity seems to be negatively correlated with volatile.acidity and pH.
* Also, fixed.acidity is positively correlated with citric.acid and density.
* While fixed.acidity is corelated to all factors affecting quality, it surprisingly has no significant impact on quality.
* Alcohol content seems to be negatively correlated with density.
From the above observations, lets explore related varibles.
From the above graph, fixed.acidity and citric.acid seems to have a linear relationship. However, the ratio seems to be similar across different quality grades. Let’s analyze the ratio of citric.acid and fixed.acidity across quality using box plots.
The above graph shows that although the ratio is increasing with quality, the ratio is very similar and changes marginally between quality.
Similarly, fixed.acidity and density seems to have a linear relationship but the ratio is similar across different quality grades.
No discernible relationship between fixed.acidity and volatile.acidity.
Although negative, fixed.acidity and pH seem to have a linear relationship. But the ratio seems to be similar across different quality grades.
Another relation observed from above was that alcohol and density were kind of complementary variables. It’d be good to investigate how the difference would corelate with quality.
Again, like other pairs investigated above, alcohol and density also seems to have negative relation but the ratio is quite similar across quality grades.
While the pairs of variables above helped get more insights the pairs didn’t help much towards devising a good model to predict quality. Lets try and see how the groups of positively related variables corelate with quality.
with(wine,
cor(quality, citric.acid+alcohol+sulphates))
## [1] 0.520615
with(wine,
cor(quality, volatile.acidity+pH+density))
## [1] -0.3020624
with(wine,
cor(quality, citric.acid+alcohol+sulphates-volatile.acidity-pH-density))
## [1] 0.5535468
The above model with all significant variables seems to be more corelated to quality than any individual variables alone.
Let’s build a linear model for predicting the quality of wine based on above observations.
m1 <- lm(quality ~ alcohol, data = wine)
m2 <- update(m1, ~ . + citric.acid)
m3 <- update(m2, ~ . + sulphates)
m4 <- update(m3, ~ . - volatile.acidity)
m5 <- update(m4, ~ . - pH)
m6 <- update(m5, ~ . - density)
mtable(m1, m2, m3, m4, m5, m6)
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine)
## m2: lm(formula = quality ~ alcohol + citric.acid, data = wine)
## m3: lm(formula = quality ~ alcohol + citric.acid + sulphates, data = wine)
## m4: lm(formula = quality ~ alcohol + citric.acid + sulphates, data = wine)
## m5: lm(formula = quality ~ alcohol + citric.acid + sulphates, data = wine)
## m6: lm(formula = quality ~ alcohol + citric.acid + sulphates, data = wine)
##
## ======================================================================================================
## m1 m2 m3 m4 m5 m6
## ------------------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 1.830*** 1.434*** 1.434*** 1.434*** 1.434***
## (0.175) (0.171) (0.176) (0.176) (0.176) (0.176)
## alcohol 0.361*** 0.346*** 0.338*** 0.338*** 0.338*** 0.338***
## (0.017) (0.016) (0.016) (0.016) (0.016) (0.016)
## citric.acid 0.730*** 0.513*** 0.513*** 0.513*** 0.513***
## (0.090) (0.093) (0.093) (0.093) (0.093)
## sulphates 0.814*** 0.814*** 0.814*** 0.814***
## (0.107) (0.107) (0.107) (0.107)
## ------------------------------------------------------------------------------------------------------
## R-squared 0.227 0.257 0.284 0.284 0.284 0.284
## adj. R-squared 0.226 0.256 0.282 0.282 0.282 0.282
## sigma 0.710 0.696 0.684 0.684 0.684 0.684
## F 468.267 276.595 210.501 210.501 210.501 210.501
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1688.711 -1659.955 -1659.955 -1659.955 -1659.955
## Deviance 805.870 773.917 746.576 746.576 746.576 746.576
## AIC 3448.114 3385.421 3329.910 3329.910 3329.910 3329.910
## BIC 3464.245 3406.930 3356.795 3356.795 3356.795 3356.795
## N 1599 1599 1599 1599 1599 1599
## ======================================================================================================
The R squared values are still not significant enough for any meaningful linear model.
Observations: There are very few data points for best quality and worse quality. Most of the data points is dominated by mid-quality. Thus, devising a good predictive model for quality with few data points very difficult.
Observations: Data points for quality grades 5 and 6 dominated every variable. Also, there seems to be either similar patterns or huge overlap of data points across different quality grades. As such, no individual variable was good enough to make a predictive model to predict wine quality. However, the above plots helped to generate a range of values for good quality wine. In the above plots, due to lack of data points for best quality(8) wine, clubbing quality grades 7 and 8 helped to generate best ranges of values for all variables for better quality of wine.
Multivariate analysis between different pairs of variables helped get better insights on how variables are related to each other. However, the pattern of this relationship remained quite similar across different quality grades. One such example is shown below.
## quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000000 0.0006024 0.0049172 0.0163683 0.0320490 0.0568965
## --------------------------------------------------------
## quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.003797 0.013235 0.020776 0.033750 0.108696
## --------------------------------------------------------
## quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.01270 0.02778 0.02844 0.04167 0.09870
## --------------------------------------------------------
## quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.01250 0.03411 0.03112 0.04507 0.13929
## --------------------------------------------------------
## quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.03502 0.04476 0.04050 0.05083 0.08831
## --------------------------------------------------------
## quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.005455 0.043042 0.046917 0.043204 0.054709 0.059574
During the multivariate analysis, it was found that citric.acid and fixed.acidity are correlated quite linearly. However, this linear relationship was quite similar across quality grades. As can be seen from above, the slope/ratio of the variables changed by 0.01 on an average between different quality grades. It was the similar across other pairs of variables analyzed. Thus, the multivariate analysis also didn’t help to generate any good model for predicting wine quality.
However, while individual variables and pairs didn’t help to generate any good model, all significant variables helped develop a better model than individual variables or pairs. However, the R squared value for this model was also not good enough for predicting wine quality.
The red wine data set contains about 1600 data points with 11 quantitative variables. The output is a qualitatitve variable. I started by understanding the distribution of the output(quality) variable. There seemed to be non-uniform number of data points across different quality grades. The number of data points for quality grades 5 and 6 were huge while very few for quality grades 3,4, and 8.
With the above insights in mind about the quality, I moved to exploring each individual variable for different quality grades. As expected from the data set description, alcohol content, sulphates and citric acid seemed to have positive relation with quality. On the hand, volatile.acidity, pH and density seemed to show negative relationship with quality. One interesting discovery was that while fixed.acidity is positively related with variables that impacted quality positively and negatively related with variables that impacted quality negatively, the fixed.acidity variable itself didn’t have any discernible relationship with quality. Also, while all these variables showed visible relationship, there is huge overlap in data points for all these variables for different quality grades. Thus, brought down the correlation scores of any variables with quality.
Then, I proceeded to analyze the distribution of variables to understand the best range of values for each variable. The histogram plots with all quality grades showed good range. Then I went on to narrow down the ranges for quality grade 8 alone. However, due to lack of enough datapoints, the variable ranges looked uniform. Due to this, I decided to analyze grades 7 and 8 togehter as higher quality grades combined. This helped to establish a strong sense of good quality wine with enough data points. It also helped to narrow down ranges of each individual variables for good quality wine.
After the univariate analysis I proceeded with multivariate analysis of the dataset. While the above variables had high corelation with quality, they didn’t seem to be significant enough. However, few pairs of variable showed visible correlation by the scatter plots and correlation values. While I tried to analyze these pairs against quality, the pairs still had significant overlap of data points between different quality grades. Thus, while we established linear relation by visualizations, the model didn’t perform better.
It’d be good to get more data points for higher and lower quality of alcohol and then analyze the relationships. Also, each variable seems to have huge variance in the values for the same grade. This lead to huge overlap of data points for variables between different quality grades, thus, making it difficult to discern a good predictive model. A more granular labeling of quality will help establish a model better.